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Abstract. When modeling the distribution of a set of data by a mixture of Gaussians, there are two 
possibilities: i) the classical one is using a set of parameters which are the proportions, the means 
and the variances; ii) the second is to consider the proportions as the probabilities of a discrete 
valued hidden variable. In the first case a usual prior distribution for the proportions is the Dirichlet 
which accounts for the fact that they have to sum up to one. In the second case, to each data is 
associated a hidden variable for which we consider two possibilities: a) assuming those variables to 
be i.i.d. We show then that this scheme is equivalent to the classical mixture model with Dirichlet 
prior; b) assuming a Markovian structure. Then we choose the simplest markovian model which 
is the Potts distribution. As we will see this model is more appropriate for the case where the 
data represents the pixels of an image for which the hidden variables represent a segmentation of 
that image. The main object of this paper is to give some details on these models and different 
algorithms used for their simulation and the estimation of their parameters. 
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When modeling the distribution of a set of data x = {xi,i = 1, • • • ,N} by a mixture of 
Gaussians (MoG), there are two possibilities: 

i) The classical one is using a set of parameters which are the proportions a = 
{Oik.k = 1, ■ • • ,K}, the means fi — = 1, ■ • • ,K} and the variances v = {v^k = 



and the objective is the estimation of K and the parameters = {oc,/j, v}. 
ii) The second is to consider the proportions as the probabilities of a discrete value 
hidden variable Z whith Oik = P(Z = k) : 



INTRODUCTION 
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which implies that p(x\Z = k) = 5\£ (pk, v&)- 



In the first case a usual prior distribution for a = {a^, k = 1 , • • • , K} is the Dirichlet 



Ik 1 (AaJ 4=1 

which accounts for the fact that Y*k a k = 1- 

In the second case, to each data is associated a discrete value hidden variable Z ; . 
The value zi G 1, • • • which takes Z, is then the class label of the datum x,-. When 

= {x;, z = 1 , • • • , N} represent the pixels of an image, z = {zi, i = 1 , • • • , N} represents 
its segmentation. Then, naturally, we consider two possibilities for the distribution of 
z: a) assuming the variables Z, to be i.i.d.; b) assuming that there is a spatial structure 
through the image pixel index i and thus assigning them a Potts Markov distribution. 

In this paper we give some details on these models and different algorithms used for 
their simulation and the estimation of their parameters. 



MAXIMUM LIKELIHOOD AND BAYESIAN APPROACHES 

In model (1), the classical maximum likelihood (ML) method assumes that the data 
x = {xi,i = 1, • • • ,«} are i.i.d samples from (1) and thus 

n K 

L(x\Q,K) = Y\p( Xi ) = Y\Y< a k*C(xi\nk,Vk)- (4) 

i i=\k=\ 

Then, for a given K, the objective is the estimation of = {(a^,//^, Vk),k = 1, • • • ,K} 
which is defined as ^ 

9 = argmax{z(a3|0)} (5) 

It is important to note that, the likelihood expression can become degenerate in the 
sense that it may become unbounded for particular set of parameters and data [1]. This 
makes the estimation of the parameters by this approach difficult. This is the reason for 
many authors to propose the penalized likelihood criteria to overcome this difficulty. The 
penalization term has the role to eliminate this degeneracy [2, 3, 4]. 

In the Bayesian approach, one assigns priors 7t(0), finds the expression of the posterior 

p{Q\x)ocL(x\Q)%(Q) (6) 

and then, an estimate 9 is defined either as the MAP estimate: 

9 = argmax{i:(>|9)7r(9)} (7) 

9 



or the posterior mean 

JQl(x\Q)ti(Q) d9 
Jl(x\Q)%(Q) d9 



= J Qp(Q\x) d0 = 



(8) 



The choice of the prior 7t(0) in the Bayesian approach for the MoG model has been the 
subject of interest for many Bayesian authors through the entropic or conjugate priors. 
Both approaches result to the same prior, at least for the proportion parameters a k which 
is the Dirichlet prior (3). The conjugate priors for the means are the Gaussians 

n(Adfc) = #(/i*|/iO,vo) (9) 
and for the variances are the Inverse Gamma (IG). 

jc(vit) = ig(v k \a ^ ) (10) 

What is also interesting to note is that using the IG prior for the variances in the MAP 
estimate results exactly to the necessary penalization term in the ML approach which is 
needed to eliminate the degeneracy of the likelihood. 

Computing the ML solution (5) or the MAP solution (7) can be done either directly 
or through an EM algorithm, but the PM solution (8) can not be obtained analytically 
and needs Monte Carlo (MC) algorithms. It is curious to note that, in the EM algorithm 
as well as in the MC sampling methods, one introduces the notion of hidden variables 
which is the subject of the second case modeling. 



SEPARABLE (DIRICHLET) AND MARKOVIAN (POTTS) 
MODELS FOR THE HIDDEN VARIABLES 

In model (2), to each data Xj is associated a hidden variable Z ( - and the assumption is that 
the data x ( - is a sample from p(xi\Zi = k) = fA£ (x;|//^,v^),V/ where the Z, can only take 
the values k = 1 , • • • , K. 

Then if we assume Z, to be independent and identically distributed (iid): 

P{Zi = k) = a k yi and P(Z i = k,Z j = l) = a k a u Vi,j (11) 

we can write 

P(Z = z\a,K) = Y\a Zi = ( 12 ) 

i k 

which means that Z is separable in Z, then we can find a link between the two models 
(1) and (2) which become equivalent with &k = \ LLi ~~ ^) • 

But if we assume that there are some structure (dependancy) in the hidden variables, 
then, we have to model them. The simplest model for such a structure is the Potts model: 



P(Zi = Zi\Zj = Zj, j ^ i) « exp ^ y £ 5(z/-zy) 



(13) 



where V (z) represents the neighboring elements of /, for example V (i) = i — 1 or 
1^(z) = {i — 1} or in cases where i represents the index of a pixel in an image, 
then V (i) represents the four nearest neigbors of that pixel, y is the Potts parameter. 



Using the equivalence of Gibbs and Markovian distributions, we can also write 

K(zi\zj,jeV{i),y,K) - exp|YL y . € ^ (i) 8(zi-Zj-)} 

where 7t(z;) stands in short for P(Zi = zi) and n(z) stands in short for P(Z = z). 



DATA CLASSIFICATION AND IMAGE SEGMENTATION 

These two models have been used in many data classification or image segmentation 
where the xi represents either the grey level or the color components of the pixel i and Zi 
its class labels. The main objective of an image segmentation algorithm is the estimation 
of Zi. When the hyperparameters K, = (cC£,//£, vjt),& = 1,- • • ,K and y are not known 
and have also to be estimated, we say that we are in totally unsupervised mode, when are 
known we are in totally supervised mode and we say that we are in partially supervised 
mode when some of those hyperparameters are fixed. A classical case is the one with 
fixed K. 

Assuming first K known, we can write the following: 

P (x\z,q,k) = YipixMi) = n#(*«K,oj) =nn #(*«k>°£) 5 ) 

where $t = {1, • • • ,n} represents the set of all samples (all pixels positions of an image) 
and % = {/ : Zi = k} represents the set of all samples who have the same label value 
Zi = k. Evidently, we assume that Ut'Kk = which means that all samples are classified. 

p(z,Q\x,K,y) - p(x\z,Q,K) n(z\y,K) 71(e) (16) 



Then, one can try to estimate both z and from this expression either by alternate 
maximization: 

z = argmax^|p(z, Q\x,K,y) j 
= argmax e {p(z,0|a;,^,Y)} 
or by first estimating and then using it for the estimation of z: 

= argmax{p(0|a3,^,Y)} — ► z = argmaxi p(z\Q,x,K,y) \ . (18) 

However, the first step of this second approach cannot be done explicitly and needs an 
iterative algorithm using the hidden variables z as the missing data. The Bayesian EM 
algorithm has particularly been developped for this: 

Estep: £(0|0 W ) = E|lnp(Z,0|a;,^,Y)|0 W } 
Mstep: 0( f+1 ) = argmax e |<2(0|0W)| (19) 



The full Bayesian approach consists in exploring the whole posterior probability 
distribution by generating samples from it. This can be done through a Gibbs sampling 
algorithm: 

f z ~ p(z\Q,x,K,y) 
1 ~ p(Q\z,x,K,y) 

where 

p(z\B,x,K,y)<xp(x\B,z,K,y)n(z\y,K) (21) 

and 

p(Q\z,x,K,y) oc p( x \B,z,K,y) ji(9) (22) 

where 7l(z|y,^T) is given either by (24) or by (25). 

The main difficulty in these relations is that the joint distribution p(z,Q\x,K,y) 
is not separable in its arguments. A framework which will give us the possibility to 
establish interesting relations between these approaches is the approximation of this 
joint distribution by a separable one which becomes variational techniques. 



VARIATIONAL BAYES 



To be able to compare the two approaches, we consider 

p(z,Q\x,K) = p(x\z,Q h K) Tl(z\Q 2 ,K) %{<d\K) / p{x\K) (23) 

where 8i = {//, v}, 7t(0i) = %(ji)%(v) = LkX(vk)x(vk) with n(n k ) = 9i (jik\w,vo) and 
3c(vfc) = /^(vfc|ao,Po) and where n(z\Q2,K) is given either by 

%(z\a,K) = \\P{Zi = k\a,K) = ]Jaf i8{zi - k \ (24) 

i k=l 

where 02 = a, or by 



K(z\y,K) = YlP(Zi = k\z-i,a,K)ocJl ei xply £ S(zi-Zj)\ (25) 

where 2 = Y and where %(Q\K) = %{Qi\K) %(Q 2 \K). 
In these equations 

p(x\K) = J^ fp(x,z,B\K) d0 = £ f p(x\z,Q h K) %(z\Q 2 ,K) %{Q\K) dQ (26) 

z -> z -> 

which is the evidence of the model K and can be used to determine K. 



Let consider a free distribution q(z,Q) and compare it to the joint posterior 
p(z,Q\x,K) and the complete dat likelihood p(x : z,Q\K) via the the two follow- 
ing quantities: 

• Free energy: 

!F( q (z,e):p(x,z,Q\K)) = L z / ? (z,0)ln^fg|pd0 

= L z / g (z,e)ln ^' 9 '^y^ de (27) 

= L z fq(z,Q)ln^§^dQ + lnp(x\K) 

• Kullback-Leibler relative entropy between the free distribution q(z, 0) and the joint 
posterior p(x, z\Q,K): 

X (q(z,Q) : p(x,z\Q,K)) = £ f q(z,Q)ln ^ d0 (28) 

z J P{z,v\x,K) 

Then, we may note that 

lnp(x\K) — f (q(z,Q) : p{x,z,Q\K)) = (q{z,Q) : p(x,z\B,K)) > (29) 

so that the free energy J (q(z,Q) : p(x,z,Q\K)) is a lower bound for \np(x\K). This 
also shows that minimizing (q(z,Q) : p(x, z\Q,K)) or maximizing f (q(z,Q)) result 
to the same optimal solution q(z : Q) = p(z : Q\x 7 K) which is the joint posterior. 

These relations are valid for any q(z : Q) and in particular for a separable q(z : Q) — 
qi(z) #2(6)- This remark is the main idea behind the variational Bayes method which 
tries to approximate the joint non-separable distribution p(z,Q\x,K) by a separable 
q{z,Q\x,K) = q\[z) q 2 (Q) where q\ and q2 have to be determined in such a way 
that either the Kullback-Leibler criterion H(_(q : p) be minimized or the free energy 
J (q(z, 0)) be maximized. Noting that 

X(qi(z)q 2 (Q):p(z,Q\x,K)) = (/ 92 (0)ln^^ d©) 

= Lz9i(«) (< p(z,B\x,K) > 92( e) -H (q 2 )) 

= /?2(e)(L^i(z)ln^^) d0 

= /92(e) (< p(z,Q\x,K) > qi(z) -X ( qi )) 

and the fact that %i(q : p) is concave in q\ for fixed q 2 and in q 2 for fixed q\ its 
optimization can be done in an iterative way 

= argmin ?1 q£ : p)\ = argmax ?1 | J (41 4^ : p) 

^ 2 r+1) = argmin^ {^(^ g 2 :p)j = argmax ?2 { J q 2 : p) 



where ? notes the iteration number. It is then easy to show that, at each iteration t, the 
solutions is obtained by computing the derivatives of the corresponding functionals and 
equating them to zero, which leads to: 



q\ '(z) oc exp 
q^2 +l \^) 06 ex P 



<lnp(a;,2,0|^) >^ )(0) 
<lnp(x,Z,9|^) > 9 ( 0(z) 



(30) 



where < . > q mean the expectation over q. For more details on this approach see [5, 6]. 

Noting that p{x,z,B\K) = p(x\z,B,K)k(z\B,K)ti(B\K), we see that the choice of 
the priors n(z\Q,K) and n(B\K) as well as the choice of parametric family of q\{z) and 

^2(0) is °f great importance for the expressions of q^ +l \z) and q^ + ^ (B) and their final 
q\{z) and q^{B). To obtain a computationally effective inference method, it is necessary 
to choose appropriately these distributions. For example, choosing conjugate priors for 
the hyperparameters n{B\K), we gain the advantage that the posterior n(Q\z,x,K) or 
%{B\x,K) expressions will be in the same family than the associated priors. 
Between particular cases, we may mention the following: 

Optimal case: q\{z) = p(z\x,K) and #2(6) = p(B\x,K) 

This means that p(z,B\x,K) is approximated by the product p{z\x,K) p(B\x,K). The 
solution in this case is immediate: 

q\{z) =p(z\x,K) =ZzP(x\z,Q,K)n(z\Q,K)/p(x\K) 
q* 2 (Q)=p(Q\x,K) = fp(x\z,B,K)n(B\K) dQ/ P (x\K) 

However, computing any of these two terms needs integration (integration over for the 
first and summation over z for the second. 

Degenerate case: qi(z) = p(z\B*,x,K) and qi(B) = p(B\z*,x,K). 
where 0* and z* are two point estimators of p(B\z,x,K) and p{z\B,x,K). 
This case is obtained through the following iterations: 

qf(z) = b(z-z^) _f q i ; +l \z)=p(z\P\x,K) 
c$(B) = 5(0 _ 0(0) " ^ I ^2 +1) (9) =p(Q\z^ t \x,K) 

which means that p(z,B\x,K) is approximated by the product p(z\B,x,K) p(B\z,x,K). 
Both expressions are available up to their normalizing factors: 

p(z\B,x,K) oc p(x\z,B,K)n(z\B,K) 
p(B\z,x,K) oc p (x\z,B,K)ti(B,K) 

However, computing at each iteration 0^ and zP>, which may be either the means 
or modes of these two distributions, may still need some effort. In particular, in the 
expression of p(z\B, x,K), depending on the prior ti(z\B,K) the computational cost and 
difficulties are different. The separable case of (24) is much easier than the Markovian 
case of (25). 
Variational EM: 

qi(z)=p(z\B*,x,K) and q 2 (B) = p(B\x,K) where 0* = argmax e {p(B\x,K)}. 



This case is obtained through the following iterations: 

r cf; +l \z)=p(z\Q«Kx,K) 

4 f) (e) = 8(0-0 w ) — J 0( f+1 ) = argmaxe { 2(0, eW)} with (Mstep) 

( Q(Q,Q^)=<\np(x,Z,Q\K)> p{z p )jXX) (E step) 

The next step in approximations is to choose q\ (z) = Yli<lu(zi) or q 2 (Q) = n*<?2*(0*) 
or both. The first case is only necessary for Markovian models of the labels. 

Mean Field + EM: 

<7i(» =UiP(zi\z-i,Q*,x,K) and q 2 (Q) = p(Q\x,K) where 0* = argmax e {p(Q\x,K)}. 
This case is obtained through the following iterations: 

H e(-D = argmaxe {Q(0,0«)} with (Mstep) 
2 [ Q(Q^)=<\np(x,ZMK)> p{z p )xK) (Estep) 

where 

p(zi\z- i ,B( t \x,K)~p(x\z,Q( t \K)p(zi\z-i,W) 
Mean Field + separable EM: 

qi(z)=p(z\Q*,x,K) and q 2 (Q) = U k q 2 k(Qk\x,K) where Q* k = argmax 9t {p(Q k \x,K)} 



where 



02(0) = ll*42*(a*) ?2*C"*) <?2*(v*) for Dirichlet 
92(8) = 92(Y) 11*92*0"*) 92*(v*) for Potts 



Totally separable (Mean Field): 

^i(^) =IlfP(zi|2;-f J 9* J a5 J ^') and 42(6) =n*42*(a*)42*C"*)?2*(v*). 



APPLICATIONS IN DATA CLASSIFICATION AND IN IMAGE 

SEGMENTATION 

The mixture of Gaussians model are natural models for data classification. When the 
data are the scalar grey level x\ or the color components X{ of a pixel in an image, 
their classification result to the segmentation of that image. In the following, we note 
by T{ the coordinate position of the pixel i, by x,- = x{t\) its grey level, by Zi = z(r,-) its 
classification label and by = {r,- : z(r,-) = k} all the disjoint compact regions having 
the same label. We assume that Ufa n = <t>> %H = (|) and U*^*/ fl = f£ 
which cover the whole image. What is more specifique in image segmentation compared 
to other data classification is the fact that there is a spatial organization of the data. A 
MoG model with Dirichlet prior does not account for this spatial organization, but the 
the same MoG with a Markovian Pottz prior accounts for that. There are also many 
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FIGURE 1. Classification and segmentation: a) an image, b) its original classification labels, c) his- 
togram of the pixels, d) histogram of the labels, e) and f) Segmentation using MoG with Dirichlet 
(supervised and unsupervised; g) and h) Segmentation using MoG with Pottz (supervised and unsuper- 
vised); i,j,k,l are the classification errors (differences between original classification in b) and obtained 
classification in e), f), g) and h). 



other possibilities of modeling this spatial organization, but the Potts Markov model is 
the probably the simplest one. To illustrate this, let consider the image of the Figure 1-a 
and its original labels in b). The histogram of the image pixels c) is shown in c) and 
the histogram of the labels in d). The results of segmentation using MoG with Dirichlet 
prior are shown in e) supervised and in f) unsupervised. The results of segmentation 
using MoG with Potts prior are shown in g) supervised and in h) unsupervised. We 
observe that, in both cases of supervised and unsupervised, the results with Potts prior 
are better that those with Dirichlet priors. 



CONCLUSION 



The mixture of Gaussians model are used extensively for data classification and in image 
segmentation. When there is no prior knowledge of any spatial organization of the data, 
the Dirichlet prior can be used. However, in image segmentation, this model often does 
not give satisfactory results, because the spatial organization of the pixels is ignored. 
Using the Potts prior gives better results because this prior accounts for the spatial 
organization of the pixels. 
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